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Abstract 

We propose a Multi-Layer Network based on the Bayesian framework of the Factor Graphs 
in Reduced Normal Form (FGrn) applied to a two-dimensional lattice. The Latent Variable 
Model (LVM) is the basic building block of a quadtree hierarchy built on top of a bottom 
layer of random variables that represent pixels of an image, a feature map, or more gen¬ 
erally a collection of spatially distributed discrete variables. The multi-layer architecture 
implements a hierarchical data representation that, via belief propagation, can be used for 
learning and inference. Typical uses are pattern completion, correction and classification. 
The FGrn paradigm provides great flexibility and modularity and appears as a promising 
candidate for building deep networks: the system can be easily extended by introducing 
new and different (in cardinality and in type) variables. Prior knowledge, or supervised in¬ 
formation, can be introduced at different scales. The FGrn paradigm provides a handy way 
for building all kinds of architectures by interconnecting only three types of units: Single 
Input Single Output (SISO) blocks. Sources and Replicators. The network is designed like 
a circuit diagram and the belief messages flow bidirectionally in the whole system. The 
learning algorithms operate only locally within each block. The framework is demonstrated 
in this paper in a three-layer structure applied to images extracted from a standard data 
set. 

Keywords: Bayesian Networks, Factor Graphs, Deep Belief Networks 


1. Introduction 


Building efficient representations for images, and more in general for sensory data, is one 
of the central issues in signal processing. The problem has received much attention in the 
literature of the last thirty years because, almost invariably, the extraction of information 
from observations requires that raw data is translated first into “feature maps” before 
classification or filtering. 

Recent striking results with “deep networks” have generated much attention in machine 
learning on what is known as Representation Learning (see (Bengio et al., 2012) for a review). 
The main idea of these methods is to learn multiple representation levels as progressive 
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abstractions of the input data. The creation of a feature hierarchy permits to the structure 
inside the data to emerge at different scales combining more and more complex features as 
we go upward in the hierarchy (Bengio and Delalleau, 2011| , (Bengio et al. 2014). 

For image understanding this process is somewhat biologically plausible too. There is 
a vast literature that postulates the hierarchical organization of the primary visual cortex 
(VI). The neurons become selective for stimuli that are increasingly complex, from simple 
oriented bars and edges to moderately complex features, such as a combination of orien¬ 
tations, to complex objects ( Serre and Poggiof 2010). We do not derive our models from 
the biology, but we cannot avoid recognizing that the most successful artificial systems 
paradigms share some common features with what is observed in nature. 

In building an artificial system, one of the key issues is to provide sufficiently general 
methods that can be applied across different kinds of sensory data, letting learning capture 
most of the specificity of the application context. This is why in this work we focus on a 
Bayesian network approach, that has the advantage of being totally general with respect 
to the type of data processed defining a framework that can easily fuse information coming 
from different sensor modalities. In a Bayesian network the information flow is bi-directional 
via belief propagation and can easily accommodate various kinds of inferences for pattern 
completion, correction and classihcation. 


Various architectures have been proposed as adaptive Bayesian graphs (Roller and Fried¬ 


man 


2009), (Barber, 2012), but in our case the use of Factor Graphs (Forney, 2001), 


(Loeliger, 2004), specially in the simplified Reduced Normal Form (Palmieri, 2013), allows 


better modularity. Message propagation follows standard sum-product rules, but the sys¬ 
tem is built as the interconnection of only SISO blocks, source blocks and replicators with 
learning equations defined in a totally localized fashion. 

In this paper we proposes a new deep architecture based on FGrn applied to a two- 
dimensional lattice. The Latent Variable Model (LVM) ( [Murphy 2012), (Bishop, 1999), 
also known as Autoclass ( Cheeseman and Stutz[ [1996 ) is the basic building block of a 
quadtree hierarchy. Learning is totally localized inside the SISO blocks that constitute 
the LVMs. The complete system can be seen as a partitioned type of Latent Tree Model 


(Mourad et al. 2013). 


The application of the Bayesian model to images shows how the hierarchy extracts the 
primitives at various scales and how, via bi-directional belief propagation, it provides a 
reliable structure for learning and inference in various modes. 

In Section 2 we review some of the related literature while in Section 3 we introduce 
notations and the basics of belief propagation in FGrn. In Section 4 we present the LVM, 
i.e. the building block for the multi-layer architecture that is presented in Section 5 with 
the learning strategy and the Encoding/Decoding process. In Section 6 we apply learning 
and inference to images from a standard data set. Section 7 includes conclusive remarks 
and suggestions for further work. 


2. Related Work 

The vast literature on the deep representation learning (see the extensive overview in 


(Schmidhuber, 2015)) can be mostly divided in two main lines of research: the first one is 
based on probabilistic graphical models such as the Restricted Boltzmann Machine (RBM) 
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(Hinton et al., 2006), ( ]Hinton and Salakhutdinov 2006), (Lee et al., 2008) and the second 


one is based on neural network models as the autoencoder (Bengio et al., 2007), (Ranzato 


et ah, 2006). At the same time several unsupervised feature learning algorithms have been 


proposed: Sparse Coding (Olshausen and Field 


2006), Autoencoders (Bengio et al. 


et al.^ 

20081)', K-means (ICoates and Ngl 


1996|,(Lee et al. 

2008 

(Ranzato et al. 

2006), 


2008), RBM (Hinton 


Other models based on the memory-prediction 


theory of brain have also been proposed (Hawkins, 2004), (Dileep, 2008) 


Conhning our interest to probabilistic graphical models, the most natural choice for mod¬ 
eling the spatial interactions between pixels (or patches) in the image is a two-dimensional 
lattice (Markov Random Field - MRF) where the nodes represent the pixels (or patches) 


and the potential functions are associated to the edges between adjacent nodes (Wainwright 


and Jordan 2008). Various tasks in image processing such as denoising, segmentation, and 


super-resolution, can be treated as an inference problem on the MRF. For these models con¬ 
vergence of the inference is not guaranteed and even if for large-scale models it is intractable, 
approximate and sub-optimal methods have been often used: Markov Chain Monte Carlo 


(Xiong et al., 2007). 


methods (Geman and Geman, 1984), (Gelfand and Smith, 1990), variational methods (Jor¬ 


dan et al. 1999), (Beal 2003), graph cut (Boykov et al. , 1999) and Belief Propagation 


An alternative strategy to MRF is to replace the 2D lattice with a simpler and approx¬ 
imate model as multiscale (or multiresolution) structures like quadtrees. These have the 
advantages of allowing the application of efficient tree algorithms to perform exact infer- 


ence with the trade off that the model is imperfect (’ 

juettgen et al. 

1993 

), ( 

Bouman and 

Shapiro 

1994), ( 

Nowak 

1999 

), ( 

Laferte et al. 

2000 

), 

(Willsky 

2002 

). Another problem of 


the quadtree structure is the non locality since two neighboring pixels may or may not share 
a common parent node depending on their position on the grid. For avoiding this problem 
Wolf et al. have proposed a Markov cube adding additional connections at the different 


levels (Wolf and Gavin, 2010) 


On the quadtree structure inference can be performed using the belief propagation algo¬ 
rithm that was originally proposed for inferences on trees where exact solutions are guaran¬ 


teed (Pearl, 1988). When the graph has loops, open issues still remain about the accuracy 


of inferences, even though often the bare application of standard belief propagation may 


already provide satisfactory results {loopy belief propagation) (Yedidia et al., 2005), (Frean 
2008). When the problem can be reduced to a tree, belief propagation provides exact 


marginalization and algorithms for learning latent trees have been proposed (Choi et al. 


2011) with successful applications to computer vision. 


A very appealing approach to directed Bayesian graphs for visualization and manipula¬ 
tion, that has not found its full way in the applications, is the Factor Graph (FG) represen¬ 
tation and in particular the so-called Normal Form (FGn) (Forney, 2001), (Loeliger, 2004). 


This formulation is very appealing because it provides an easy way to visualize and ma¬ 
nipulate Bayesian graphs - much like in block diagrams. Factor Graphs assign variables to 
edges and functions to nodes. Furthermore, in the Reduced Normal Form (FGrn), through 
the use of replicator units (or equal constraints), the graph is reduced to an architecture 
in which each variable is connected to two factors at most (Palmieri, 2013). In this way 


any architecture (deep or shallow) can be built as the interconnection of only three types of 
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units: Single Input Single Output (SISO) blocks, Sources and Diverters (Replicators), with 
the learning equations defined locally (Figure [^. 

This is the framework on which this paper is focused because the designed network 
resembles a circuit diagram with belief messages more easily visualized as they flow into 
SISO blocks and travel through replicator nodes ( Buonanno and PalmierTf 2014). This 
paradigm provides extensive modularity because replicators act like buses and can be used 
as expansion nodes when we need to augment an existing model with new variables. Pa¬ 
rameter learning, in this representation, can be approached in a unified way because we can 
concentrate on a unique rule for training any SISO, or Source, factor-block in the system, 
regardless of its location (visible or hidden). 

In our previous work (Palmieri and Buonanno, 2014) we have reported some preliminary 
results on a multi-layer convolution Bayesian Factor Graph built as a stack of HMM-like 
trees. Each layer is built from a latent model trained on the messages coming from the 
layer below. The structure is loopy, but our experience has shown that BP performs well 
in recovering information from the deep parts of the network: the upper layers contain 
progressively larger-scale information that is pipelined to the bottom for pattern completion 
or correction across sequences. 

In this work we step back and conhne our attention to a quadtree structure, for which 
no loops are present and inference is exact. We have found that this framework, even if 
just a tree, has great potential of being used in a very large number of applications for its 
inherent modularity at the expenses of a certain growth in computational complexity. The 
complexity issue will be discussed in the paper. To our knowledge the FGrn framework has 
never been used to build deep networks. 


3. Factor Graphs in Reduced Normal Form 


In the FGrn framework (Palmieri 2013) the Bayesian graph is reduced to a simplified form 
composed only by Variables, Replicators (or Diverters), Single-Input/Single-Output (SISO) 
blocks and Source blocks. Even though various architectures have been proposed in the 
literature for Bayesian graphs (Loeliger, 2004), we have found that the FGrn framework 


is much easier to handle, it is more suitable to define unique learning equations (Palmieri 


2013) and it is more suited for distributed implementations. The blocks needed to compose 


any architecture are shown in Figure In our notation we avoid the upper arrows for 
the messages and assign a direction to each variable branch for unambiguous definition of 
forward and backward messages. 


For a variable X (Figure[^a)) that takes values in the discrete alphabet X = {^i, ^ 2 , ■ ■ •, 
^dx}^ forward and backward messages are in function form bx{/i) and fx{Ci)o = ^-dx and 
in vector form bx = (6x(6), &x(6), • • ■, and fx = lfx{^i),fx{^ 2 ),---,fx{^dx)V- 

All messages are proportional (oc) to discrete distributions and may be normalized to sum 
to one. 

Gomprehensive knowledge about X is contained in the posterior distribution px ob¬ 
tained through the product rule, px{Ci) oc fx{^i)bx{Ci)^ i = 1 : dx, in function form, or 
Px oc fx 0 bx, in vector form, where 0 denotes the element-by-element product. The 
result of each product is proportional to a distribution and can be normalized to sum one 
(it is a good practice to keep messages normalized to avoid poorly conditioned products). 
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fx fv fx 


Figure 1: FGrn components: (a) a variable branch; (b) a diverter; (c) a SISO block; (d) a 
source block. 


The replicator (or diverter) (Figure[2b)) represents the equality constraint with the vari¬ 
able X replicated (Zi)-|-1) times. Messages for incoming and outgoing branches carry different 
forward and backward information. Messages that leave the block are obtained as the prod- 


uct of the incoming ones: bxio) Hi) oc te); fxW Hi) ^ fx(o) Hi) 0^ 


k = 1 : D, 


In vector form: b 


( 0 ) 




f{0) 1^ 


X 


oc 




(J). f{k) 


^X 


OC 


; = 1 : da; in function form 
X^k = l-.D. 

The SISO block (Figure [^c)) represents the conditional probability matrix of Y given 
X. More specihcally if X takes values in the discrete alphabet X = and Y 

in T = {l’i,U 2 , ...jUdy}) P0^\^) is the dx x dy row-stochastic matrix P{Y\X) = [Pr{Y = 
Vj\X = = [^ij]iZydx - Outgoing messages are: fvivj) oc bxHi) « 

(^ijbyivj), in function form. In vector form: fy oc P{Y\X)'^ix] bx oc P(y|X)by. 

The source block in Figure [^d) is the termination for the independent source variable 
X. More specifically ttx is the dx-dimensional prior distribution on X with the outgoing 
message fxHi) — 'i^xHi)H = 1 : dx in function form, or fx = ttx in vector form. The 
backward message bx coming from the network can be combined with the forward fx for 
final posterior on X. 

For the reader not familiar with the factor graph framework, it should be emphasized 
that the above rules are rigorous translation of Bayes’ theorem and marginalization. For a 


more detailed review, refer to our recent works (Palmieri, 2013), (Buonanno and Palmieri 


2014) (or to the classical papers (Loeliger, 2004) (p4schischang et al. 2001)). 


Parameters (probabilities) in the SISO and the source blocks must be learned from 
examples solely on the backward and forward flows available locally. We set the learning 
problem as an EM algorithm to maximize global likelihood ( ]PalmierT 2013). Focusing on 
a specific SISO block, if all the other network parameters have been fixed, maximization of 
global likelihood translates in the local problem from examples (fx[n]) t»y[„]), n = 1,..., A'^e 


mine 


-En=llog(f^M 


b 


Y[n] 


6 row — stochastic. 


( 1 ) 
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After adding a stabilizing term to the cost function and applying KKT conditions we obtain 
the following algorithm (Palmieri, 2013). 


Algorithm 1 Learning Algorithm for SISO block 
1: procedure LEARNING Algo 
2: Initialize 6 to uniform rows: 9 = (l/dy)l(ixxdy 

3: for i = I : dx do 

4: ftmp{i) = J2n=l fx[n]ii) 

5: end for 

6: for it = 1 : Nu do 

7: for n = 1 : Ae do 

8: den{n) = 

9: end for 

10: for i = 1 : dx do 

11: for j = 1 : dy do 

12. tmpSura = 

13: Oij ^ ■ tmpSum, 

14: end for 

15: end for 

16: Row-normalize 9 

17: end for 

18: end procedure 

(We have used the shortened notation /x[ni(Ci) = /x[n](i), 6v[n](wj) = bY[n]{j))- 


In Algorithmic there are three main blocks and the complexity in the worst case is 0{Nf,- 
dx ■ dy ■ Nit). The algorithm is a fast multiplicative update with no free parameters. The 
iterations usually converge in a few steps and the presence of the normalizing factors makes 
the algorithm numerically very stable. The algorithm has been discussed and compared to 


other similar updates in (Palmieri, 2013). 


The updates for the source block are immediate if we set the forward messages ^x[n] fo 
a uniform distribution and consider any row of 9 to be the target distribution. 


4. Bayesian Clustering 

For the architectures that will follow, the basic building block is the Latent- Variable Model 
(LVM) shown in Figure At the bottom of each LVM there are N ■ M variables X[n,m], 
n = 1 : N, m = 1 : M that belong to a finite alphabet X = {^ 1 ,^ 2 , ■ ■ ■ ,Cdx}- The 
variables are organized here on a plane (as an image) because they will compose the layers 
of a multi-layer architecture. The N ■ M variables code multiple discrete labels, that in 
the application that follows take values in the same alphabet, but they could easily have 
different cardinalities if we need to fuse information coming from different sources (the 
combination of the heterogeneous variables is one of the most powerful peculiarities of 
the FGrn paradigm). Generally the complexity of the whole system increases with the 
cardinality of the alphabets. 
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(a) (b) 

Figure 2: A (N ■ M) - tuple with the Latent Variable as a Bayesian graph (left) and as a 
Factor Graph in Reduced Normal Form (right). Only the first SISO block has 
been explicitly described. 


The N ■ M bottom variables are connected to one Hidden (Latent) Variable 5, that 
belongs to the finite alphabet S = {ai,a 2 , ■ ■ ■, crdg}- The replicator block of Figure [^b) is 
drawn here as a box as it will be a patch of the upper plane of each layer in our architecture. 
Each connection to the bottom layer is a SISO block that represents the ds x dx row- 
stochastic probability matrix 


P{X[n,m]\S) = 


P{X[n,m] = = ai) 

P{x\n,m] = = 0 - 2 ) 


P(V[n, m] = idx\S = cri) 
P{X[n,m] = CJ 2 ) 


P{X[n,m] = Ci\S = ads) = ^dx\S = crds)j 


The system is drawn as a generative model with the arrows pointing down assuming that the 
source is variable S and the bottom variables are its children. This architecture can be seen 
also as a Mixture of Categorical Distributions (Roller and Friedman, 2009). Each element of 
the alphabet S represents a ' 

X = [X[n,m]]^=,^:^ (similar to the Naive Bayes classifier (iBarberl 


’Bayesian cluster” for the N • M dimensional stochastic image, 

Essentially 


2012 )). 


each bottom variable is independent from the others given the Hidden Variable (]Koller 


and Eriedman, 2009). One way to visualize the model is to imagine drawing a sample: for 


each data point we draw a cluster index s G 5 = {iti,iT 2 , ... ,< 7 ^ 3 } according to the prior 
distribution ns- Then for each n = 1 : V, m = 1 : M, we draw x[n,m] G 
according to P{X[n,m]\S = s). 

We can perform exact inference simply by letting the messages propagate and collecting 
the results. Information can be injected at any node and inference can be obtained for each 
variable using the usual sum-product rules. Eor each SISO block of Figure the incoming 
messages (bx and fs) and the outgoing messages {fx and 65 ) flow simultaneously following 
the rules outlined in the previous section (sum rule). In the replicator block, incoming 
messages from all directions are combined with product rule to produce outgoing messages. 
We can imagine the replicator block as acting like a bus where information is combined and 
diverted towards the connected branches. 

Handling information in the Bayesian architecture is very flexible since each variable 
X[n,m\ corresponds to a pair of messages. The backward message coming from below 
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is propagated upward towards the latent variable and, through the diverter, towards the 
sibling branches downwards to the forward messages at the terminations. At the same 
time the latent variable, fed through its forward message from above, sends information 
downward through the diverter. 

5. Multi-layer FGrn 

In this work we build a multilayer structure as in Figure |^a) on top of a bottom layer 
of random variables. They can be pixels of an image, a feature map, or more generally a 
collection of spatially distributed discrete variables. In the following we refer to the bottom 
variables as the Image. 

The architecture that lays on top of the Image is the quadtree. In Figure the cyan 
spheres are the image variables and the other ones (red, green and blue) are the embedding 
(latent or hidden) variables of the LVM blocks. In Figure [^b) the same architecture is 
represented as a FGrn. 

A network with L + 1 levels (Layer 0,...,Layer L) covers a bottom image (Layer 0) 
5o[n,m] n = 1 : iV • 2^“^, m = 1 : M ■ 2^“^, subdivided in (2^“^) • (2^“^) image patches 
of dimension {N ■ M). At Layer 1 each patch is managed by one of the latent variables 
5i[n, m], n = 1 : 2^“^, m = 1 : of cardinality of ds^- At Layer 2 each latent variable 

52 [n, m], n = 1 : 2^“^, m = 1 : 2^“^ with dimension is connected to 4 variables of 
Layer 1. Similarly climbing the tree in quadruples up to the root with variable Sl. 

Messages travel within each layer (among the subset of LVM variables) and with the 
layers above and below (within the connected patches and quadruples). The architecture 
builds a hierarchical Bayesian clustering of the information that is exchanged across different 
representation scales. 

5.1 Inference Modes 

If the network parameters have been learned, the system can be used in the following main 
inference modes: 

Generative: A latent variable Si[n, m] is hxed to a value cr^, i.e. its forward distribution is a 
delta fsiis) = (5(s —cr^), £ Si = {cj^,cj 2 , ..., ( 7 ^^ }. After message propagation downward 

the forward messages at the terminal variables S'o[n, m] in the cone subtended by Si reveal 
the k-th ’’hierarchical conditional distribution” associated to Si. This generation could be 
done on Layer 1 to check for clusters in the image patches, or at higher layers to visualize 
the coding role of the various hierarchical representations. 

Of course propagation can be done from a generic node also upward with a backward 
delta distribution. The complete upward and downward flow, up to the tree root and down 
to the other terminations, would reveal the role of that specific node in the representation 
memorized in the system (a sort of impulse response). 

Encoding: The image So = So[n,m], n = 1 : N ■ 2^~^, m = 1 : M ■ 2^~^, is known 
and the values of the bottom variables are injected in the backward messages as delta 
distributions. After all messages have been propagated for a number of steps equal to the 
network diameter (in this case 2 • L + 1), at each hidden variable Si[n, m], i = 1 ,..., L, 
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Layer 3 


Layer 2 


Layer 1 
Layer 0 


(a) 



(b) 


Figure 3: (a) The quadtree architecture; (b) Reduced Normal Factor Graph representation 
of the quadtree architecture with 4 layers (0-3). 
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n = 1 : 2'^“*, m = 1 : 2^“*, we find the contribution of the observations to the posterior for 
Si[n,m]. The exact posterior on Si[n,m] is obtained as the normalized product of forward 
and backward messages. Each hidden variable represents one of the components of the 
(soft) code of the image. 

Pattern Completion: Only some of the bottom variables of Sq are known, i.e. their back¬ 
ward messages are deltas. For the unknown variables the backward messages are uniform 
distributions. In this modality, after at least 2 • L + 1 steps, the network returns forward 
messages at the terminal variables that try to complete the pattern {associative recall^ or 
content-addressed memory). 

Error Correction: Some of the bottom variables of Sq are known softly, or wrongly, 
with smooth distributions, or delta functions respectively. After at least 2 • L + 1 steps 
of message propagation, the network produces forward messages at the terminations that 
attempt to correct the distributions, or reduce the uncertainty. The posterior distributions 
at the terminal variables S'o[n, m] are obtained as the normalized product of forward and 
backward messages. 

Before propagation, all messages that do not correspond to evidence, are initialized to 
uniform distributions. 


5.2 Learning 


The parameters contained in the EG are learned from a training set of T images Sg,..., Sq . 

We assume that within each layer the LVM blocks share the same parameters. This is 
a standard shift-invariance assumption that most deep belief networks make. 

Given a basic patch of dimension N ■ M pixels and a network with L -|- 1 levels (Layer 
0,... ,Layer L), for Layer 1 we need to learn N ■ M matrices P{So[n, m]|S'i) (one per pixel) 
each one having sizes dsi x dsQ and the dsj-dimensional prior vector 115 ^. 

For Layers 2 to L we need to learn 4 matrices P(S'i_i[n, m]|S'i) having sizes x 
and the d 5 .-dimensional prior vector 115 .. 

A generic image of the training set is subdivided in L-Level patches of dimension (2^“^ • 
N) ■ {2^~^ ■ M) pixels. Each L-Level patch is subdivided in 2 • 2 (L — 1)-Level patches, 2^ • 2^ 
(L — 2)-Level patches and so on until to obtain (2^“^) • (2^“^) patches of dimension {N ■ M) 
pixels at Layer 1 (0-Level and 1-Level patches are the same). 

The examples, subdivided in Ist-Level patches, are presented to the termination of 
the LVM in Fig. as sharp backward distributions for a fixed number of steps (epochs). 
All SISO blocks and the source block adapt their parameters using an iterative Maximum 
Likelihood Algorithm (Palmieri (2013)) outlined in Section]^ 

Once the Layer 1 is learned, the 2nd-Level patches are used to learn Layer 2 constructed 
combining 4 LVMs of Layer 1 and the process goes on, building deeper and deeper network 
and considering larger and larger patches. 

At the end of the learning phase the matrices are frozen and used in one of the inference 
modes described before on the same training set to check for accuracy and on a test set to 
check for generalization. 

More specifically, learning is off-line and it is composed by the following steps for an 
architecture of L -I- 1 layers and a basic patch dimension oi N ■ M pixels: 
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Figure 4: Learning Steps for a 4-Layers Architecture: (a) Learning Layer 1; (2) Learning 
Layer 2; (3) Learning Layer 3 


1. We randomly select P L-Level Patches from each image in the Training Set composed 
by T images. Therefore, in the learning phase, we have T ■ P L-Level patches that 
are subdivided in Patches of the lower levels until to obtain Ist-Level Patches; 

2. The T ■ P ■ (2^“^ • 2^“^) Ist-Level Patches of {N ■ M) pixels are injected at the bottom 
of the LVM and the parameters are learned (Figure ia))i 

3. A new 3-Layers network (0-2) is built replicating 2 • 2 times the LVM block learned 
above and connecting their Hidden Variables with another LVM Block; 

4. The T ■ P ■ (2^“^) • Patches of (2 • N) ■ (2 • M) pixels are injected at the bottom 

of the new 3-Layers network and the backward messages at the top of Layer 1 are 
used to learn the LVM block at Layer 2 (Figure Qb)); 

5. A new 4-Layers network (0-3) is built replicating for 2^ • 2^ times the LVM block 
learned at step 2 and for 2 • 2 times the LVM block learned at step 4 and connecting 
their Hidden Variables to another LVM Block; 

6 . The T ■ P ■ (2^“^) • (2^“^) Patches of (2^ • N) ■ (2^ • M) pixels are propagated in the 
Layer 1 and Layer 2 and the backward messages at the top of the Layer 2, are used 
to learn the LVM block at Layer 3 (Figure [^c)); 

7. The same progression is applied to all the other layers, extending the number of LVM 
block replicas to cover the dimension of the current-Level Patch. 
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Figure 5: Some images from the Training Set 


6. Simulations 

In this set of simulations we have taken 50 car images from CaltechlOl Dataset. Each image 
is cropped, filtered with an anisotropic diffusion algorithm (Kovesi), whitened and finally 
filtered with a Canny filter in order to obtain images with only the car borders. Our input 
alphabet is binary {dso = 2). From the 50 filtered images a set of 500 image patches of 
32 • 32 pixels are randomly extracted. A small subset is shown in Fig. 


6.1 Learning 

The steps for the learning phase are described in the previous and use the following variables: 
p = 10, T = 50, iV = 8, M = 8, L = 3, ds^ = 2, ds, = 100, ds^ = 300, ds^ = 300. 


6.2 Inference 

Once the matrices have been learned we use the network in various inference modes: 

Generative mode: We obtain forward distributions at the bottom terminations by inject¬ 
ing at the top of the various structures a delta distribution (the images in gray scale show 
at each pixel the probability on one of the two symbols). More specifically, for visualizing 
the conditional distributions corresponding to Layer 1 we consider only the Latent Model 
in Figure for Layer 2 we consider the 3-Layers architecture composed by 4 LVM Blocks 
connected to one LVM Block; for Layer 3 we consider the complete architecture. Figures]^ 
and show respectively the forward distributions generated injecting deltas at Layers 1, 
2 and 3. 

The network has stored quite well the complex structures. The forward distributions 
from Layer 1 represent simple orientation patterns similar to the ones that the early human 
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Figure 7: 100 of 300 forward distributions learned at the second level. Dimension of the 
Embedding Space; 300 
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Figure 8: 100 of 300 forward distributions learned at the third level. Dimension of the 
Embedding Space: 300 


visual system responds to. At Layer 3 the distributions reflect the combined representations 
stored at larger scales. 

Pattern Completion: In these experiments we have used the architecture as an associative 
memory that is queried with incomplete patterns and responds with the information stored 
during the learning phase. Figurej^a) shows 20 patches extracted from Training Set. Before 
the injection at the bottom of the network, a considerable amount of pixels (16 patches on 
a total of 32 patches) has been erased (Figure [Wb)), i.e. the delta backward distribution is 
replaced with an uniform distribution. Figure Wc) shows the forward distributions for the 
same images after message propagation. The network is able to resolve quite well most of 
the uncertainties using the stored information. 

The same experiment of Pattern Completion has been repeated with patches extracted 
from the Test Set (patterns that the network has never seen before). The result is obviously 
worse as shown in Figure [Tol but it’s worthy to note that the network succeeds quite well in 
completing most of the shapes in applying the learned knowledge (test for generalization). 
Cross-validation can be easily applied to determine the most appropriate embedding spaces 
sizes for best generalization. 

7. Discussion and Conclusions 

From the results presented above we have demonstrated that the paradigm of the FGrn can 
be successfully applied to build deep architectures. The layers retain the information about 
the clusters contained in the data and build a hierarchical internal representation. Each 
layer successfully learns to compose the objects made available from the lower layers. 
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Figure 9: Completion of Patterns obtained from the Training Set. (a) Original image, (b) 
Image with many erasures (in gray), (c) Image inferred by the network 
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Figure 10: Completion of Patterns obtained from the Test Set. (a) Original image, (b) 
Image with some erasures (in gray), (c) Image inferred by the network. 
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We have chosen the border of car images extracted from CaltechlOl because we wanted 
to see if the paradigm was suitable for patching together the salient structures of an object. 
Other experiments have been performed on characters and different patterns revealing very 
similar results. 

We believe that the FGrn paradigm constitutes a promising addition to the various 
proposals for deep networks that are appearing in the literature. It can provide great 
flexibility and modularity. The network can be easily extended by introducing new and 
different (in cardinality and in type) variables. Prior knowledge and supervised information 
can be inserted at any of the scales: new “label variables” can be added in one or more of 
the diverter junctions and let learning take care of parameter adaptation. Results of these 
mixed supervised/unsupervised architectures are under way and will be reported elsewhere. 

The computational complexity issue that clearly emerges from this paradigm, specially 
when the embedding variables have large dimensionality and when image pixels are non 
binary, is being exploited for parallel implementations. Since both belief propagation and 
learning are totally local, they can be implemented with distributed hardware or parallelized 
processes. Some studies have been carried out for other deep network frameworks (Liang 


et al., 2009), (Silberstein et al., 2008), (Ma et al., 2012)) and we are confident that similarly 


the FGrn paradigm may present new interesting opportunities to approach some of the 
most challenging tasks in computer vision. 
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